Episode 1: Ride to Glory

Introduction and Exploratory Data Analysis

The citi.bike bike share program in New York City is looking to add two more bike share locations. These sites should be decided based on two factors, Raising traffic (in both subscribers and non-subscribers) and increasing female traffic.

We began by identifying the variable most important for determining whether somebody is a customer (non-subscriber) or subscriber using classification trees. Our trees demonstrated that the trips duration was the most important classification variable. Using this information, we plotted the trip duration histogram below. The red line on the graph, represents 30 minutes, after which price increases. We were thus able to find a set of outliers, which we identified as rides above 3600 seconds, or 1 hour. We found that Customers tended to stay near that limit, as can be seen in the table below. Subscribers, however, tended to use bikes for shorter rides, mostly under 20 minutes.

We then theorized that there may be significant differences during the different periods of the work day. Theorizing, that there were morning commuters, evening commuters and leisure riders in the middle of the day. We plotted the top destinations and starting points on a geogrphical mapping of New York City. Below is the graph, the red and yellow dots represent Customers and blue to green represents Subscriebrs. Using this classification and this graph, we identified how Subscribers, the majority of our data, used bikes. We found short trips, often from bus and subway stations and major traffic hubs to destinations less than 20 minutes away. With this information, we designed our model.

Modeling

To determine the optimal placement of the new bike stations, we must build a model that takes in a placement location and determines how effective that placement is. Total traffic through a bike station was used as a measure of effectiveness (number of riders starting at the location + number of riders ending at the location). We fit linear and general additive models to predict the total number of people who will start and end at a given latitude and longitude. Since the general additive models had a lower cross validated mean square error, we chose to use them over the linear models.

We predicted our models on three subsets of our dataset: women subscribers, non male-subscribers and everyone. Since we found earlier in our analysis that traffic hubs, such as Penn Station and Grand Central Terminal, are places that many people like to take bikes to and from. We formed a small radius outside of Penn Station and Grand Central Terminal and predicted over those areas. From these predictions, we found the latitude and longitude corresponding to the highest predicted traffic for each of these subsets.

Results And Conclusion

Using our general additive models we identified two locations we think best answer our initial factors. The first site we chose was 41st and Dywer Ave (40.758023, -73.994736) near the Lincoln tunnel. Two of our models, initial and gender, identified this location based on the coordiantes of Penn Station and the surrounding area. In this area, there is a large gap between this site and other bike share stations, it has access to CUNY and Port Authority. These two sites are key as university students are tradionally a large bike using group, and as we found in our exploratory data analysis, large traffic hubs are important locations for Subscribers. The other site we identified was 67th and West Drive (40.7728,-73.9765) in Central Park. We identified this site through our initial and gender model. Since most Customers, ride less than 25 minutes, we theorize that many Customers will stop in Central Park to relax and have a good time. This site is also along West Drive, which is the best way to traverse Central Park.

Below are graphical displays of the two spots we have chosen.